Many recent studies have found that marital status increasingly correlates with socioeconomic status (UPenn Wharton Budget Model, Pew Research, American Enterprise Institute). In particular, these studies have shown that those who get married are increasingly high-income, college-educated individuals who have the financial stability to support a family.
I will use individual-level Census PUMS data to see if these trends also hold in the Bay Area. In this section, I study the relationship between marital status (which I set as my dependent variable) and income and educational attainment (which I set as my independent variables) in the Bay Area. This analysis will be at the PUMA level. I will focus on individuals aged 18 or older, since individuals under 18 rarely marry.
ca_pums = get_pums(variables = c("PUMA", "AGEP", "MAR", "SCHL", "PINCP"),
state = "CA",
year = 2019,
survey = "acs1",
recode = T)
saveRDS(ca_pums, file = "ca_pums.rds")
# Get PUMS and PUMAS data
pums_vars_2019 = pums_variables %>%
filter(year == 2019, survey == "acs1")
pums_vars_2019_inds = pums_vars_2019 %>%
distinct(var_code, var_label, data_type, level) %>%
filter(level == "person")
ca_pums = readRDS("ca_pums.rds")
ca_pumas = pumas("CA", cb = T, progress_bar = F)
# Isolate Bay Area PUMS
bay_county_names =
c(
"Alameda",
"Contra Costa",
"Marin",
"Napa",
"San Francisco",
"San Mateo",
"Santa Clara",
"Solano",
"Sonoma"
)
ca_counties = counties("CA", cb = T, progress_bar = F)
bay_counties = ca_counties %>%
filter(NAME %in% bay_county_names)
bay_pumas = ca_pumas %>%
st_centroid() %>%
.[bay_counties, ] %>%
mutate(PUMACE10 = as.numeric(PUMACE10)) %>%
st_set_geometry(NULL) %>%
left_join(ca_pumas %>% select(GEOID10)) %>%
st_as_sf()
bay_pums = ca_pums %>%
mutate(PUMA = as.numeric(PUMA)) %>%
filter(PUMA %in% bay_pumas$PUMACE10)
bay_pums
# Clean dataset
# Filter out all children under age 3 (SCHL == bb)
# Convert "character"-encoded variables MAR and SCHL to doubles
# Filter dataset so all individuals are over 18 years old
# Create new variable "married" representing number of ever-married individuals in each PUMA
# Create new variable "college" representing number of college-educated individuals in each PUMA
# Collapse to PUMAs
# Calculate percentage of married and college-educated people in each PUMA
# Additionally calculate weighted mean income in each PUMA
bay_marriage = bay_pums %>%
filter(SCHL != "bb") %>%
mutate(MAR = as.numeric(MAR),
SCHL = as.numeric(SCHL)) %>%
filter(AGEP >= 18) %>%
mutate(married = ifelse(
(MAR < 5),
PWGTP,
0),
college = ifelse(
(SCHL >= 21),
PWGTP,
0
)
) %>%
group_by(PUMA) %>%
summarize(mean_personal_income = weighted.mean(PINCP, PWGTP, na.rm = T),
perc_married = sum(married, na.rm = T)/
sum(PWGTP, na.rm = T)*100,
perc_college = sum(college, na.rm = T)/
sum(PWGTP, na.rm = T)*100)
bay_marriage
bay_marriage_map = bay_marriage %>%
left_join(bay_pumas %>% select(PUMACE10),
by = c("PUMA" = "PUMACE10")) %>%
st_as_sf()
marriage_pal = colorNumeric(palette = "Oranges",
domain = bay_marriage_map$perc_married)
leaflet() %>%
addTiles() %>%
addPolygons(data = bay_marriage_map,
fillColor = ~marriage_pal(perc_married),
color = "white",
opacity = 0.5,
fillOpacity = 0.5,
weight = 1,
label = ~paste0(round(perc_married),
"% married individuals"),
highlightOptions = highlightOptions(weight = 2,
opacity = 1)) %>%
addLegend(data = bay_marriage_map,
pal = marriage_pal,
values = ~perc_married,
title = "Percentage of population<br>that is or has been married")
Most PUMAs are majority-married people. The exceptions are San Francisco and Berkeley, but these are understandable exceptions — San Francisco is known for having a high population of young single tech workers, while Berkeley has a large population of young single recent college graduates.
# Regression
model = lm(perc_married ~ mean_personal_income, bay_marriage)
summary(model)
##
## Call:
## lm(formula = perc_married ~ mean_personal_income, data = bay_marriage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.524 -2.357 1.993 5.483 14.192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.285e+01 3.725e+00 16.873 <2e-16 ***
## mean_personal_income 5.933e-05 4.962e-05 1.196 0.237
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.471 on 53 degrees of freedom
## Multiple R-squared: 0.02627, Adjusted R-squared: 0.007898
## F-statistic: 1.43 on 1 and 53 DF, p-value: 0.2371
# Scatterplot
ggplot(data = bay_marriage,
aes(x = mean_personal_income,
y = perc_married)) +
geom_point() +
geom_smooth(method = "lm")
From the regression, it doesn’t seem that personal income itself has an impact on the likelihood of marriage in any given census tract. (An increase in $10,000 in income corresponds to a 0.5 percentage point increase in marriage rate, a small increase indeed.) A scatterplot seems to show a positive correlation between personal income and marriage rates, but the margin of error for this line is high — it’s possible to draw a straight line through the area covered by the margin of error.
# Regression
model = lm(perc_married ~ perc_college, bay_marriage)
summary(model)
##
## Call:
## lm(formula = perc_married ~ perc_college, data = bay_marriage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.4493 -3.0704 0.9118 5.3427 16.4459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.024808 3.568125 18.784 <2e-16 ***
## perc_college 0.001355 0.073618 0.018 0.985
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.584 on 53 degrees of freedom
## Multiple R-squared: 6.394e-06, Adjusted R-squared: -0.01886
## F-statistic: 0.0003389 on 1 and 53 DF, p-value: 0.9854
# Scatterplot
ggplot(data = bay_marriage,
aes(x = perc_college,
y = perc_married)) +
geom_point() +
geom_smooth(method = "lm")
It also appears from the regression that having a college education does not have an impact on marriage rates. The scatterplot shows an essentially flat line with a large margin of error, suggesting that it is unlikely the two are related.
# Correlation plot
correlationplot = bay_marriage %>%
select(mean_personal_income,
perc_college,
perc_married) %>%
cor()
corrplot(correlationplot,
method = "number",
type = "upper")
Before interpreting regression results, it is important to note that college education and income are highly correlated, as shown in the correlation plot above. Thus there is bound to be collinearity in this regression, which undermines the model’s predictive power.
# Regression
model = lm(perc_married ~ mean_personal_income + perc_college, bay_marriage)
summary(model)
##
## Call:
## lm(formula = perc_married ~ mean_personal_income + perc_college,
## data = bay_marriage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.4374 -2.4510 0.6059 4.4006 14.0392
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.6079529 3.5538365 17.898 <2e-16 ***
## mean_personal_income 0.0003123 0.0001090 2.865 0.0060 **
## perc_college -0.4108149 0.1595885 -2.574 0.0129 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.054 on 52 degrees of freedom
## Multiple R-squared: 0.1363, Adjusted R-squared: 0.1031
## F-statistic: 4.104 on 2 and 52 DF, p-value: 0.02213
When regressing both together, it seems that both factors have a significant impact on marriage rates. Increases in mean personal income correspond to increases in marriage rates, as predicted in my hypothesis. Contrary to what previous researchers find, increases in college attendance correspond to decreases in marriage rates. However, this effect seems to be less pronounced than the income effect.
One possible explanation for this discrepancy is that unlike in most of the U.S., low-income people in the Bay Area are predominantly immigrants. Immigrants across the socioeconomic spectrum tend to be married at much higher rates than American-born citizens, according to the American Enterprise Institute and the Institute for Family Studies. Since so many low-income people in the Bay Area are immigrants, it is possible that they are raising marriage rates among less-educated low-income people even higher than marriage rates among college-educated people.
Another possible explanation is that there are enough young single college graduates in San Francisco that they weaken the relationship between college education and marriage rates. As shown in the map below, all areas where the marriage rate is under 50% are in San Francisco, Berkeley, Oakland, and Emoryville. The San Francisco areas overlap partly (though not entirely) with rapidly gentrifying areas as indicated in a map on this blog, while Berkeley is likely to be populated by a number of recent college graduates who are unmarried. Thus it is possible that an unusually large number of unmarried recent college graduates is also “flattening” the relationship between college education and marriage in the Bay Area.
bay_marriage_map_test = bay_marriage %>%
filter(perc_married < 50) %>%
left_join(bay_pumas %>% select(PUMACE10),
by = c("PUMA" = "PUMACE10")) %>%
st_as_sf()
leaflet() %>%
addTiles() %>%
addPolygons(data = bay_marriage_map_test,
color = "blue",
opacity = 1,
fillOpacity = 0.2,
weight = 4,
label = ~paste0(round(perc_married),
"% married individuals"),
highlightOptions = highlightOptions(weight = 4,
fillOpacity = 0.5))
According to the CDC and a study published in Sleep Medicine, poorer people tend to sleep less than richer people. Additionally, a recent meta-analysis in Sleep Health finds that white children and adolescents tend to sleep more than their non-white peers. Intuitively, it makes sense that poorer people sleep less, since they may work longer hours and undergo more stresses related to financial insecurity. In the Bay Area specifically, lower-income people tend to live farther from their workplace than rich people due to high housing costs in the Bay, so long commute times may also contribute to sleep deprivation. Since people of color, especially Black and Hispanic/Latinx people, are disproportionately likely to be poor, it also makes sense that race correlates with sleep deprivation.
In this section, I investigate whether this relationship also exists in the Bay Area. I conduct a Census tract-level analysis using data from the CDC’s 500 Cities project and the 2018 American Community Survey 5-year estimates.
all_health = read_csv("https://chronicdata.cdc.gov/api/views/6vp6-wxuq/rows.csv?accessType=DOWNLOAD")
saveRDS(all_health, file = "all_health.rds")
all_health = readRDS("all_health.rds")
ca_health = filter(all_health, StateAbbr == "CA")
ca_sleep = filter(ca_health, MeasureId == "SLEEP")
# Filter to Bay Area census tracts
ca_tracts = tracts("CA", cb = T, progress_bar = F)
bay_sleep = ca_sleep %>%
filter(!is.na(TractFIPS)) %>%
filter(!is.na(Data_Value)) %>%
mutate(perc_sleepdep = Data_Value) %>%
left_join(ca_tracts %>% select(GEOID),
by = c("TractFIPS" = "GEOID")) %>%
st_as_sf() %>%
st_centroid() %>%
.[bay_counties, ] %>%
st_set_geometry(NULL) %>%
left_join(ca_tracts %>% select(GEOID),
by = c("TractFIPS" = "GEOID")) %>%
st_as_sf()
bay_sleep
sleep_pal = colorNumeric(palette = "Purples",
domain = bay_sleep$perc_sleepdep)
leaflet() %>%
addTiles() %>%
addPolygons(
data = bay_sleep,
fillColor = ~sleep_pal(perc_sleepdep),
color = "white",
opacity = 0.5,
fillOpacity = 0.75,
weight = 1,
label = ~paste0(round(perc_sleepdep), "% in ", TractFIPS),
highlightOptions = highlightOptions(weight = 2,
opacity = 1)) %>%
addLegend(data = bay_sleep,
pal = sleep_pal,
values = ~perc_sleepdep,
title = "Percentage of residents<br>who sleep less than 7<br>hours per night")
We use ACS 2018 5-year data because tract-level data is only available in the 5-year surveys.
acs_vars_2018_5yr = listCensusMetadata(name = "2018/acs/acs5",
type = "variables")
bay_income_race = getCensus(name = "acs/acs5",
vintage = 2018,
region = "tract:*",
regionin = "state:06+county:001,013,041,055,075,081,085,095,097",
vars = c("B19001A_001E",
"B19001_001E",
"B19013_001E")) %>%
transmute(tract = paste0(state, county, tract),
perc_white = B19001A_001E / B19001_001E * 100,
med_income = B19013_001E) %>%
filter(!is.na(perc_white),
!is.na(med_income))
bay_income_race
bay_sleep_income_race = bay_income_race %>%
left_join(bay_sleep %>% select(TractFIPS, perc_sleepdep),
by = c("tract" = "TractFIPS")) %>%
filter(!is.na(perc_sleepdep))
bay_sleep_income_race
In plotting the relationship between income and sleep deprivation, it becomes obvious that there are some outliers in the income variable.
# Scatterplot
ggplot(data = bay_sleep_income_race,
aes(x = med_income,
y = perc_sleepdep)) +
geom_point() +
geom_smooth(method = "lm")
# Finding the outliers
outlier_exam = bay_sleep_income_race %>%
st_as_sf() %>%
arrange(med_income) %>%
.[c(1,2),]
outlier_exam
# Mapping the outliers
outlier_exam %>% leaflet() %>%
addTiles() %>%
addPolygons()
As it turns out, the outliers are two non-residential areas in San Francisco (Golden Gate Park and an industrial area around Islais Creek). The encoding suggests that income is not available in those two areas. I drop those two values and use this adjusted dataset to perform all further analyses.
bay_sleep_income_race = bay_sleep_income_race %>%
filter(med_income > 0)
# Regression
model = lm(perc_sleepdep ~ med_income, bay_sleep_income_race)
summary(model)
##
## Call:
## lm(formula = perc_sleepdep ~ med_income, data = bay_sleep_income_race)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.398 -1.873 -0.004 1.725 10.063
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.978e+01 2.231e-01 178.30 <2e-16 ***
## med_income -6.216e-05 2.026e-06 -30.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.795 on 1092 degrees of freedom
## Multiple R-squared: 0.4629, Adjusted R-squared: 0.4624
## F-statistic: 941 on 1 and 1092 DF, p-value: < 2.2e-16
# Scatterplot
ggplot(data = bay_sleep_income_race,
aes(x = med_income,
y = perc_sleepdep)) +
geom_point() +
geom_smooth(method = "lm")
Now that the outliers have been removed, it is clear that there is a negative association between sleep deprivation and income. Poorer people are more likely to be sleep-deprived; richer people are less likely. This relationship is significant.
# Regression
model = lm(perc_sleepdep ~ perc_white, bay_sleep_income_race)
summary(model)
##
## Call:
## lm(formula = perc_sleepdep ~ perc_white, data = bay_sleep_income_race)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1006 -2.2562 -0.4398 2.2207 8.6396
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.238804 0.235387 166.70 <2e-16 ***
## perc_white -0.112826 0.004236 -26.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.969 on 1092 degrees of freedom
## Multiple R-squared: 0.3938, Adjusted R-squared: 0.3933
## F-statistic: 709.4 on 1 and 1092 DF, p-value: < 2.2e-16
# Scatterplot
ggplot(data = bay_sleep_income_race,
aes(x = perc_white,
y = perc_sleepdep)) +
geom_point() +
geom_smooth(method = "lm")
There is a negative association between the percent of white people in a tract and the percent of sleep-deprived individuals. This relationship is also significant.
# Correlation plot
correlationplot = bay_sleep_income_race %>%
select(perc_sleepdep,
med_income,
perc_white) %>%
cor()
corrplot(correlationplot,
method = "number",
type = "upper")
There is a relationship between median income and percentage of white people in each census tract, but the relationship is not strong enough that I am concerned about collinearity for this analysis.
# Regression
model = lm(perc_sleepdep ~ med_income + perc_white, bay_sleep_income_race)
summary(model)
##
## Call:
## lm(formula = perc_sleepdep ~ med_income + perc_white, data = bay_sleep_income_race)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6363 -1.5603 -0.0685 1.5055 7.7371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.260e+01 2.159e-01 197.30 <2e-16 ***
## med_income -4.866e-05 1.737e-06 -28.02 <2e-16 ***
## perc_white -8.169e-02 3.417e-03 -23.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.265 on 1091 degrees of freedom
## Multiple R-squared: 0.6475, Adjusted R-squared: 0.6469
## F-statistic: 1002 on 2 and 1091 DF, p-value: < 2.2e-16
Echoing the single regression results above, both income and race have a significant impact on sleep deprivation rates. An income increase in $10,000 corresponds to a 0.4 percentage point decrease in sleep deprivation rates; a 1 percentage point increase in the percentage of white people corresponds to a 0.08 percentage point decrease in the percentage of sleep deprivation rates.
Overall, national trends in sleep deprivation are echoed in the Bay Area.